The data contains features extracted from the silhouette of vehicles in different angles. Four "Corgie" model vehicles were used for the experiment: a double decker bus, Cheverolet van, Saab 9000 and an Opel Manta 400 cars. This particular combination of vehicles was chosen with the expectation that the bus, van and either one of the cars would be readily distinguishable, but it would be more difficult to distinguish between the cars.
Object recognition
The purpose is to classify a given silhouette as one of three types of vehicle, using a set of features extracted from the silhouette. The vehicle may be viewed from one of many different angles.
All the features are geometric features extracted from the silhouette. All are numeric in nature.
Exploratory Data Analysis Reduce number dimensions in the dataset with minimal information loss Train a model using Principle Components
Apply dimensionality reduction technique – PCA and train a model using principle components instead of training the model using just the raw data.
#Import all the necessary modules
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from scipy import stats
from sklearn import tree
from sklearn import preprocessing
from sklearn.preprocessing import LabelEncoder
from warnings import simplefilter # import warnings filter
simplefilter(action='ignore', category=FutureWarning) # ignore all future warnings
from sklearn.decomposition import PCA
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
# Load dataset
df= pd.read_csv("vehicle-1.csv")
df.head(10)
df.shape
df.info()
# All columns have numerical values
# Class would be the target variable. Should be removed when PCA is done
df.isna().sum()
There are various ways to handle missing values. Drop the rows, replace missing values with median values etc. from above we can say that we have NAN or 0 in the column. We could drop those rows - which might not be a good idea under all situations. Here, we will replace them with their median values.
df['skewness_about'].unique()
df = df.replace('0', np.nan)
# replace the missing values with median value.
# Note, we do not need to specify the column names below
# every column's missing value is replaced with that column's median respectively
df=df.fillna(df.median())
df.isna().sum()
df.describe().transpose()
Observations: Compactness and circularity has mean and median values almost similar , it signifies that it is normally distribited and has no skewness/outlier scatter_ratio feature seems to be having some kind of skewness and outlier
# Check for duplicate data
dups = df.duplicated()
print('Number of duplicate rows = %d' % (dups.sum()))
# Label Encoding
le=preprocessing.LabelEncoder()
df['class']=le.fit_transform(df['class'])
print(df.shape)
df.head()
A bivariate analysis among the different independent variables can be done using scatter matrix plot. Seaborn libs create a dashboard reflecting useful information about the dimensions
# Check for correlation of variable
df.corr(method='pearson')
fig=plt.figure(figsize=(15,12))
sns.heatmap(df.corr(),annot=True)#correlation function
1.'circularity' is highly correlated with 'max.length_rectangularity' and 'scaled_radius_of_gyration'(0.96 & 0.93)
2.'scatter_ratio' shows high significance with 'pr.axis_rectangularity' ,'max.length_rectangularity', 'scaled_variance','scaled_variance.1','scaled_radius_of_gyration' and 'distance_circulation'
3.'pr.axis_rectangularity' is positively correlated with 'scaled_variance','scaled_variance.1' and 'scaled_radius_of_gyration'
4.'scaled_variance' and 'scaled_variance.1' is also positively correlated
sns.pairplot(df, diag_kind='kde') # to plot density curve instead of histogram on the diag
# Remove anyoutliers, standardize variables in pre-processing step
df.boxplot(figsize=(20,3))
# We could see few outliers here. Possible mode of imputation:
plt.boxplot(df['radius_ratio'])
z=np.abs(stats.zscore(df))
print(z)
threshold=3
print(np.where(z>3))
print(z[4][4]) #z score higher than 3
df=df[(z<3).all(axis=1)]
df.shape #new shape
df.boxplot(figsize=(20,3))
# We could see most of the outliers are now removed.
plt.boxplot(df['radius_ratio'])
#Let us break the X and y dataframes into training set and test set. For this we will use
#Sklearn package's data splitting function which is based on random function
from sklearn.model_selection import train_test_split
#now separate the dataframe into dependent and independent variables
#print("shape of new_vehicle_df_independent_attr::",X.shape)
#print("shape of new_vehicle_df_dependent_attr::",y.shape)
X = df.iloc[:,0:18].values
y = df.iloc[:,18].values
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.30, random_state=10)
from sklearn import svm
clr = svm.SVC(gamma='scale')
clr.fit(X_train , y_train)
y_predict = clr.predict(X_test)
# Calculation of accuracy
print("Accuracy on training set: {:.4f}".format(clr.score(X_train, y_train)))
print("Accuracy on test set: {:.4f}".format(clr.score(X_test, y_test)))
model_score = clr.score(X_test, y_test)
print(model_score)
#Store the accuracy results for each model in a dataframe for final comparison
resultsDf = pd.DataFrame({'Method':['svm(raw data)'], 'accuracy': model_score})
resultsDf = resultsDf[['Method', 'accuracy']]
resultsDf
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
num_folds = 50
seed = 7
kfold = KFold(n_splits=num_folds, random_state=10)
model = LogisticRegression()
results = cross_val_score(model, X, y, cv=kfold)
print(results)
print("Accuracy: %.3f%% (%.3f%%)" % (results.mean()*100.0, results.std()*100.0))
kfmodel_score = results.mean()
print(kfmodel_score)
tempResultsDf = pd.DataFrame({'Method':['K-fold cross val(raw data)'], 'accuracy':kfmodel_score })
resultsDf = pd.concat([resultsDf, tempResultsDf])
resultsDf = resultsDf[['Method', 'accuracy']]
resultsDf
# Drop class variables
df_new =df.drop(['class'], axis =1)
df_new.head()
# Scaling The Independent Data Set
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
df_scaled = sc.fit_transform(df_new)
df_scaled
covMatrix = np.cov(df_scaled,rowvar=False)
print(covMatrix)
covMatrix.shape
# Step 2- Get eigen values and eigen vector
eig_vals, eig_vecs = np.linalg.eig(covMatrix)
print('Eigen Vectors \n%s', eig_vecs)
print('\n Eigen Values \n%s', eig_vals)
# Step 2- Get eigen values and eigen vector
eig_vals, eig_vecs = np.linalg.eig(covMatrix)
print('Eigen Vectors \n%s', eig_vecs)
print('\n Eigen Values \n%s', eig_vals) #3.Find variance and cumulative variance by each eigen vector
tot = sum(eig_vals)
var_exp = [( i /tot ) * 100 for i in sorted(eig_vals, reverse=True)]
cum_var_exp = np.cumsum(var_exp)
print("Cumulative Variance Explained", cum_var_exp)
# Make a set of (eigenvalue, eigenvector) pairs:
eig_pairs = [(eig_vals[index], eig_vecs[:,index]) for index in range(len(eig_vals))]
# Sort the (eigenvalue, eigenvector) pairs from highest to lowest with respect to eigenvalue
eig_pairs.sort()
eig_pairs.reverse()
print(eig_pairs)
# Extract the descending ordered eigenvalues and eigenvectors
eigvalues_sorted = [eig_pairs[index][0] for index in range(len(eig_vals))]
eigvectors_sorted = [eig_pairs[index][1] for index in range(len(eig_vals))]
# Let's confirm our sorting worked, print out eigenvalues
print('Eigenvalues in descending order: \n%s' %eigvalues_sorted)
# Ploting
plt.figure(figsize=(10 , 5))
plt.bar(range(1, eig_vals.size + 1), var_exp, alpha = 0.5, align = 'center', label = 'Individual explained variance')
plt.step(range(1, eig_vals.size + 1), cum_var_exp, where='mid', label = 'Cumulative explained variance')
plt.ylabel('Explained Variance Ratio')
plt.xlabel('Principal Components')
plt.legend(loc = 'best')
plt.tight_layout()
plt.show()
Observations: From above we plot we can clealry observer that 8 dimension are able to explain 95 %variance of data.so we will use first 8 principal components going forward and calulate the reduced dimensions
pca = PCA(n_components=8)
pca.fit(df_scaled)
df_pca=pca.transform(df_scaled)
pca.components_
df_scaled.shape
df_pca.shape
# Fitting SVM ON PCA Data
# Split the data into train and test
X1_train, X1_test, y1_train, y1_test = train_test_split(df_pca,y,test_size=0.30, random_state=10)
clr1 = svm.SVC(gamma='scale')
clr1.fit(X1_train , y1_train)
x_pca_predict = clr1.predict(X1_test)
print("Accuracy on training set: {:.4f}".format(clr1.score(X1_train, y1_train)))
print("Accuracy on test set: {:.4f}".format(clr1.score(X1_test, y1_test)))
model_score1 = clr1.score(X1_test, y1_test)
print(model_score1)
tempResultsDf = pd.DataFrame({'Method':['svm(pca data)'], 'accuracy': [model_score1]})
resultsDf = pd.concat([resultsDf, tempResultsDf])
resultsDf = resultsDf[['Method', 'accuracy']]
resultsDf
# K-fold cross validation on PCA data
num_folds = 50
seed = 7
kfold1 = KFold(n_splits=num_folds, random_state=seed)
model1 = LogisticRegression()
results = cross_val_score(model1, df_pca, y, cv=kfold1)
print(results)
print("Accuracy: %.3f%% (%.3f%%)" % (results.mean()*100.0, results.std()*100.0))
kfmodel_score1 = results.mean()
print(kfmodel_score1)
tempResultsDf = pd.DataFrame({'Method':['K-fold cross val(pca data)'], 'accuracy':kfmodel_score1})
resultsDf = pd.concat([resultsDf, tempResultsDf])
resultsDf = resultsDf[['Method', 'accuracy']]
resultsDf
Observations: